Customer segmentation is a key strategy in e-commerce that helps businesses classify customers based on their purchasing behavior.
It enables targeted marketing efforts and supports customer retention strategies.
Traditionally, segmentation was based on demographic attributes.
With the rise of machine learning, data-driven methods like K-Means clustering have become more prevalent.
This study uses K-Means clustering to analyze an online retail dataset and identify customer groups based on purchase behaviors.
Recent research emphasizes the effectiveness of clustering algorithms for customer segmentation.
Introduction
K-Means clustering is favored for its efficiency and ease of implementation (Paramita and Hariguna 2024).
The study adopts K-Means clustering using R to enable a data-driven approach for customer analysis and business decision-making.
Methods
K-Means Clustering Algorithm
The K-Means clustering algorithm is an iterative approach for partitioning a dataset into k clusters. It follows these steps to achieve convergence:
Initialize Cluster Centers:
Select k initial cluster centroids randomly or use K-Means++ (K-Means++ is an improved initialization technique that places centroids far apart to enhance convergence and accuracy.) for optimized placement.
This ensures centroids are well-separated initially.
Repeat Until Convergence (Image Source: Shiyu Gong)
Selection of k (Number of Clusters)
The optimal number of clusters k is determined using the Elbow Method, which plots WCSS against k values. The elbow point—where the rate of decrease in WCSS (Within-Cluster Sum of Squares (WCSS) is a measure of the variance within clusters. The elbow point—where the curve bends—indicates the optimal number of clusters.) diminishes—indicates the optimal k.
The Online Retail dataset, sourced from the UCI Machine Learning Repository, contains transactional data from a UK-based e-commerce store. This store specializes in selling unique giftware, which is frequently purchased in bulk by customers. The dataset spans transactions recorded between December 1, 2009, and December 9, 2011. (Chen 2015)
This dataset is particularly useful for customer segmentation, sales analysis, and market trend evaluation. It includes eight attributes that provide insights into customer purchases, product details, and order quantities. The data can be leveraged for analyzing buying behaviors, identifying customer clusters, and predicting future sales trends.
Dataset Overview
The dataset contains transactional records from an online retail store. The key attributes in the dataset include:
InvoiceNo: Unique invoice number for transactions
StockCode: Product code
Description: Product name
Quantity: Quantity purchased
InvoiceDate: Date and time of the purchase
UnitPrice: Price per unit
CustomerID: Unique identifier for customers
Country: Country where the transaction occurred
Modeling and Results
Code
# Data Manipulation and Cleaninglibrary(tidyverse)library(readxl)# Visualizationlibrary(ggplot2)library(patchwork)library(gridExtra)library(ggpubr)# Data Exploration and Analysislibrary(DT)library(knitr)library(gtsummary)library(naniar)library(VIM)library(skimr)# Clustering Analysislibrary(factoextra)library(cluster)library(clValid)library(NbClust)# Load Datasetfile_path <-"~/customer_segmentation_group_project/Online_Retail.xlsx"df <-read_excel(file_path)# Display sample rowskable(head(df), format ="html", caption ="First Few Records of the Dataset", table.attr ="class='table table-striped table-bordered'")
First Few Records of the Dataset
InvoiceNo
StockCode
Description
Quantity
InvoiceDate
UnitPrice
CustomerID
Country
536365
85123A
WHITE HANGING HEART T-LIGHT HOLDER
6
2010-12-01 08:26:00
2.55
17850
United Kingdom
536365
71053
WHITE METAL LANTERN
6
2010-12-01 08:26:00
3.39
17850
United Kingdom
536365
84406B
CREAM CUPID HEARTS COAT HANGER
8
2010-12-01 08:26:00
2.75
17850
United Kingdom
536365
84029G
KNITTED UNION FLAG HOT WATER BOTTLE
6
2010-12-01 08:26:00
3.39
17850
United Kingdom
536365
84029E
RED WOOLLY HOTTIE WHITE HEART.
6
2010-12-01 08:26:00
3.39
17850
United Kingdom
536365
22752
SET 7 BABUSHKA NESTING BOXES
2
2010-12-01 08:26:00
7.65
17850
United Kingdom
Code
# Summary Statisticsskim(df)
Data summary
Name
df
Number of rows
541909
Number of columns
8
_______________________
Column type frequency:
character
4
numeric
3
POSIXct
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
InvoiceNo
0
1
6
7
0
25900
0
StockCode
0
1
1
12
0
4070
0
Description
1454
1
1
35
0
4211
0
Country
0
1
3
20
0
38
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Quantity
0
1.00
9.55
218.08
-80995.00
1.00
3.00
10.00
80995
▁▁▇▁▁
UnitPrice
0
1.00
4.61
96.76
-11062.06
1.25
2.08
4.13
38970
▁▇▁▁▁
CustomerID
135080
0.75
15287.69
1713.60
12346.00
13953.00
15152.00
16791.00
18287
▇▇▇▇▇
Variable type: POSIXct
skim_variable
n_missing
complete_rate
min
max
median
n_unique
InvoiceDate
0
1
2010-12-01 08:26:00
2011-12-09 12:50:00
2011-07-19 17:17:00
23260
The dataset has 541,909 rows and 8 columns. CustomerID has 135,080 missing values (25% missing rate), and Description has 1,454 missing values. Quantity and UnitPrice show extreme values (e.g., negative quantities, very high prices), indicating the need for cleaning.
Code
# Visualizing Missing Valuesaggr(df, col =c("navyblue", "red"), numbers =TRUE, sortVars =FALSE,cex.axis =0.8, cex.lab =1.2, cex.numbers =1.2, main ="Missing Data Visualization")
The visualization illustrates the results of a K-Means clustering algorithm applied to customer data, projected onto two principal components—Dim1 and Dim2—which together capture approximately 95.6% of the total variance (Dim1: 73.3%, Dim2: 22.3%). The plot reveals three distinct customer segments, each represented by a unique color and shape: Cluster 1 (blue circles), Cluster 2 (yellow triangles), and Cluster 3 (grey squares). The separation between clusters is generally clear, indicating well-defined groupings; however, there is a slight overlap between Clusters 2 and 3, suggesting the possibility of similar behaviors or characteristics among customers near the boundary. This dimensionality reduction has proven effective, as most of the important information in the dataset is retained in just two dimensions, allowing for clear interpretation and visual segmentation of the customer base. These insights can be valuable for targeted marketing, personalized services, or strategic business decisions.
Dim1 and Dim2 are principal components resulting from a dimensionality reduction technique
Data Cleaning and Transformation
This reduces the dataset to valid transactions, removing 135,080 rows with missing CustomerID, cancelled transactions (starting with “C”), and entries with non-positive Quantity or UnitPrice.
Exploratory Data Analysis
EDA provides insights into the dataset’s distributions and patterns.
Quantity Histogram (A): Most transactions involve quantities between 1 and 5, with a right-skewed distribution.
Price Histogram (B): Most items are priced between 1 and 4 units, also right-skewed.
Top 5 StockCode (C): StockCode 85123A has the highest sales count (~2,000 transactions).
Top 5 Countries (D): The United Kingdom dominates with over 300,000 transactions, followed by EIRE, France, Germany, and Spain.
RFM Analysis and Data Preparation
RFM analysis quantifies customer behavior using three metrics:
Recency: Days since the last purchase (relative to December 10, 2011).
Frequency: Number of transactions per customer.
Monetary: Total spending per customer.
Outliers in Recency, Frequency, and Monetary are removed using the IQR method.
Log transformation (log1p) reduces skewness in the RFM metrics.
Data is scaled to ensure equal weighting of features during clustering.
The boxplots show that after transformation, the distributions are more symmetric, though some outliers remain (e.g., high Monetary values).
K-Means Clustering Analysis
Determining Optimal k
In this step, we are using the Elbow Method to determine the optimal number of clusters (k) for K-Means clustering. The plot displays the Total Within Sum of Squares (WSS) for different values of k, and we observe how the WSS decreases as the number of clusters increases. In our case, the elbow clearly appears at k = 3, meaning that three clusters provide the best segmentation of the data without overfitting. Choosing more than three clusters would result in only marginal improvement while increasing complexity unnecessarily.
Applying K-Means Clustering
The visualization shows three distinct clusters, with some overlap between clusters 2 and 3. Dim1 (73%) and Dim2 (22%) capture most of the variance, indicating effective dimensionality reduction.
Cluster Profiling
Cluster Summary
Cluster
Avg_Recency
Avg_Frequency
Avg_Monetary
Total_Customers
1
56.62587
2.970424
5.558872
1585
2
36.88193
4.776759
7.334640
1677
3
256.41873
2.603139
5.125849
929
Insights
Cluster 1 (Green): Moderate recency (~57 days), low frequency (~3 transactions), and moderate spending (~5.56 log units). These are occasional buyers.
Cluster 2 (Yellow): Low recency (~37 days), high frequency (~4.78 transactions), and high spending (~7.33 log units). These are loyal, high-value customers.
Cluster 3 (Blue): High recency (~256 days), low frequency (~2.60 transactions), and low spending (~5.13 log units). These are inactive customers at risk of churn.
Conclusion
K-Means clustering identified three customer segments:
Occasional Buyers (Cluster 1): Customers who purchase infrequently with moderate spending.
Loyal Customers (Cluster 2): Frequent buyers with high spending, ideal for retention strategies.
These segments enable targeted marketing: loyalty programs for Cluster 2, re-engagement offers for Cluster 3, and promotional deals for Cluster 1.
This study demonstrates the effectiveness of K-Means clustering for customer segmentation in e-commerce. By applying RFM analysis and K-Means clustering to the Online Retail dataset, we identified three distinct customer segments with actionable insights for marketing strategies. Future work could explore alternative clustering methods (e.g., DBSCAN) or incorporate additional features like product categories to enhance segmentation.
References
Bradley, Paul S, Kristin P Bennett, and Ayhan Demiriz. 2000. “Constrained k-Means Clustering.”Microsoft Research, Redmond 20 (0): 0.
Paramita, Adi Suryaputra, and Taqwa Hariguna. 2024. “Comparison of k-Means and DBSCAN Algorithms for Customer Segmentation in e-Commerce.”Journal of Digital Market and Digital Currency 1 (1): 43–62.
Tabianan, Kayalvily, Shubashini Velu, and Vinayakumar Ravi. 2022. “K-Means Clustering Approach for Intelligent Customer Segmentation Using Customer Purchase Behavior Data.”Sustainability 14 (12): 7243.